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ABSTRACT ; . ’ 


~ . Three sections of the Graduate Record Examinations 
(GRE) Aptitude Test were reviewed ‘hefore the introduction of the: » - 


testructyred test in October, 1977: research on (1) the GRE-Vetbal 
section; (2) the GRE-Quantitative section: and‘(3) a\,planned third 
section, measuring analytical thinking skills. Research in all three 

areas focused on test reliability, validity, difficulty, speededness, ¥ 
and equivalence of restructured and former test secticns. The 
-restructured verbal measure was shortened ffom 75 to 50 minutes, and 
included a long as well as a short reading comprehension fassage. 
Research on the quantitative ability test involved combinations of 
three item types: regular mathema*ics, quantitative comparison, and . 
data interpretation. The restructured test was reduced from 75 to 50 “ 
minutes, and contained about thirty quantitative comparison items in 
place of regular mathematics and data interpretation items. Seven new 
item types were evaluated for inclusicn in the abstract/analytical 
reasoning test, based upon their difficulty, reliability, 

speededness, validity, appropriateness for all college majors, 
efficiency, and independence from the other two tests. Three of the 

seven item types were accepted for use in the new GRE: analytical 
reasoning, logical diagrams, and analysis of explanations. (GDC) 6 
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_ SUMMARY OF RESEARCH ON RESTRUCTURING THE 
GRADUATE REOORD INATIONS APTITUDE TEST . 


This paper reviews the esearch on which thé introduction 
(in October 1977) of the restructured Graduate Record Examinations 
‘ (GRE) Aptitude Test was baged. Cons{fderation of a new test farmat 
began early in 1974, when the GRE Board and its Research Committee 
began a systemmatic review, of the GRE Ptogram offerings. In April -~ 


1975, a model for further fesearch and development of the test . 
format was proposed to thege groups by staff and approved in ° 
rinciple. The goal of this researth was to broaden the Aptitude 4: " 


ést and thus enable st&édents to demonstrate a wider array of be 
academic talents. The ae te can be divided into three areas: . 
(lL) research on the GRE verbal (GRE-V) section of the test; 

' (2) research on the GRE quantitative (GRE-Q) section of the test; ars 
and (3) research to develop a third module, GRE analytical CRED) 
that would allow for peas oi ‘skill measurement. 


: Research in all three: areas focused on reliability, validity, 
difficulty, speededness, and comparability of the restructured to 
old format test sections. Technical definitions of each of these 
terns are presented in depth im the GRE Technical Manual (Conrad, e 
Trismen, & Miller, 1977).’ Briefly, reliability is the extent to =~ 
which a test is ¢onsistent in measuring whatever it measures. 
Validity is the extent to which a test measures what it purports, 
to measure. Several types of validity exist: ‘"face" validity, 
the extent to which the test questions appear to be related to the 
appropriate ability; “criterion validity, the extent to which 
the test score is related to other megsutes- taken at’the same time : 
(e.g-, the relationshp of GRE scores With undergraduate grades) or - 2 
. . at a future time (e.g., the relationship ef GRE test scores with | 
first-year graduate grade-point averages); and "construct" validity, 
the extent to which test scores relate. to other measures (e-g-., 
other ability measures) in a predictable manner. Validity. and 
reliability are intérdependent - a test can be very reliable but 
not valid, while an unreliable test cannot be valid.’ Difficulty of 
a test is measured by the proportion of examinees who answer each : 4 
question correctly. The appropriate difficulty ofa test depends 
' on how the test is used and is related to reliability. Speededness — 
is\the extent to which test scores are related to fe time ‘limits . 
of le test rather than the examinee’s ability to answer the 
questions. Comparability of scores refers to whether scores on two 
forms of the same test have the same meaning. 


* 4 
( In the process of invendinaiice restructuring, it was found 
. that both verbal and quantitative sections could be revised while 
maintaining comparability of scores and appropriate reliability, 
difficulty and speededness- Seven, item formats were investigated 
\ for inclusion in the analytical measure, from which three were © 
_ chosen for inclusion in the operational measure. Detailed summaries 
of this research are presented in the following sections. 
. / 
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Prior -to October 1977, the verbal ability measure of the GR 

Aptitude Test consisted.of two'sections - a 25-minute section of; 
_discrete verbal questions including analogies, antonyms, and. 

sentence completions .and a 50-minute section of questions based on , 
six long reading passages. The verbal measure of the restructured ca 
GRE Aptitude Tast consists of one 50-minute section including. 

discrete verbal questions and reading comprehension questions based 

on both short and, long passages. : 


GRE Verbal 3 ." 
| 


This outcome was slightly different than that anticipated in | ; 
the beginning of the research effort. Altman and Conrad (19799 % 
investigated the following questions concerming restructuring of, : 
the verbal measure: 


(1) Can the verbal ability measure be shottened while 
retainiftg appropriate reliability, difficulty, 
speededness, and validity? y , 


(2) Can reading comprehension subscores be provided? 


(3) Can reading subscores be based on different reading. vg 
comprehension sections for students with different 
. undergraduate majors? 


Four 25-minute experimental tests of reading comprehension questions 
were included in the October 1975 GRE fat tonal test administration. 
They contained the following reading comprehension questions and 
content: (1) 25 questions based on humanitie® and social sciences, 
(2) 30 questions on humanities and social sciences; (3) 25 questions 
on natural and physical sciences; and (4) 30 questions on natural 
and physical sciences. Over 8,000 examinees took each of these 
experimental tests. 


Using item analyses of the responses to the experimental 
tests, estimates of difficulty, speededness, and reliability of 
each of the four experimental tests were obtained for samples of 
humanities and social science majors and biological and physical 
sctence majors. | F teh 

me t . * 

Results of the investigation indicated that rellability, 
item difficulty, and spéededness of a 50-mfnute verba] section 
would bé psychometrically appropriate. Correlations with 
self-reported undergraduate grades also suggested: that shortening 
the reading comprehension section would have no substantial effect 
on validity. Wowever, it was found that the humanities and social’ 
sciences group and natural and physical sciences group: performed 
differently on the experimental tests depending on whether the 
reading material was,from the "appropriate" discipline. For this 
‘reason,. introducing ical reading sections would mean that ‘ 
‘ comparable vefbal scores could not be provided. 


o 
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“Pinally, the feasibility of creating a reading subscore was 
considered. Although such a subscore was feasible, weaknesses 
were identified by the factor analysis study (Powers, Swinton, & 
Carlson, 1977). The factor analysis showed that a reading compre- 
hension score should include not only reading ‘comprehension but 
also sentence completion items. Although the sentence completion 
\Ltems would be included based on the statistics, sentence completion 
“items+would not have face validity & part of a reading comprehension 
subscore.- Another problem was the high correlation of the proposed 
subscore with.the total score. The idea of a reading subscote was 

therefore abandoned. 


Additional input on restructuring the verbal section was 
obtained from student and institutional surveys. Approximately 
equal percentages (72%) of students and institutions favored 
offering optional reading comprehension sections if possible. 
Students also favored the inclusion of short as well as long 
°' reading comprehension passages.» Although optional reading 
comprehension sections were not feasible, as discussed above, both 
‘short and long reading comprehension passages are included in the 
nestructured test. 
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GRE Quantitative 


Prior to October 1977, the quantitative ability section of 
the GRE Aptitude Test was 75 minutes long, containing 55 questions 
on data interpretation, geometry, arithmetic, algebra, and miscel- 
laneous item types. The restructured section lasts 50 minutes and 
contains the same number of items with about 30 quantitative 
comparison items in place of some of the regular math and data 
interpretation items. 


Altman and Conrad (1979) investigated the feasibility of 
achieving a quantitative abklity section that was only 50 minutes 
in length. The research quest{fons were:, — 


(1) Can the quantitative ability measure be shortened 
. while retaining appropriate reliability, difficulty, 
speededness, and validity? 


(2) Can a subscore for data interpretation be provided? 


‘These questions were answered by analyzing results of four 
experimental tests given in the October 1975 test administration. 
Over 8,000 examinees took e@ch of the following 25-minute tests: 
. ‘ t ; 
(1) 30 regular math questions (those currently in the test, 
A excluding data interpretation items), big 


; : 4 
(2) 40 quantitative comparison questions, 


‘ 


. 


i ; ° * ‘ Se * > : : { . . 4 
(3) 35 questions including-quant{tative comparisons plus ’ 

mixed regular types (designed to be rare to a moouse 

of the proposed BASED and Suce 


(4) 20 data Jatarpretecion Aveattein sain’. to iain 
a the sécond module ‘for the ‘proposed test. ° 


Sauglee of all examinees taking each of the four! experimental 
_tests were selected and item difficulty, reliability, and speeded- 
ness calculated for each sdmple- The second and third tests were 
also evaluated for a sample of humanities and soctal' sciences 
majors and a sample of biological and physical clences majorse> ’ 
‘Difficulty indices indicated that: foreach of the item types items +S *o 
could be written with appropriate difficulty for a final test form. 

Speededness information showed th@t the quantitative comparisons 

plus mixed regular items.were approximately as speeded as the 1 

operational quantitative section, with the other three experimental S 
test modules being more speeded. Reliability information suggested 
that both quantitative comparisons and the mix of quantitative 
comparisons and regular item types were about equally reliable. : 


A 


Comparability of scores based oh the existing quantitative 
section and the proposed quantitative section including quantitative, 
comparisons was also investigated. Comparisons of performance of 
undergraduate majors in the humanities and social sciences and- 
natural sciences majors on quantitative comparisons in relation to 
their perférmance on the operational quantitative section were 

made. Differences in performance were slightly magnified when the 
compdrison ‘was based on all quantitative comparison material; 
however,’ this was judged as not significant since quantitative 
comparisons would only be a portion of the restructured quantitative 


section. ‘ ‘ ty 


A second measure of comparability was obtained by’ reviewing a 
factor analytic study based on the same experimental tests (Powers, 
Swinton, and Carlson, 1977). This indicated that slightly different 
patterns of abilities underlie performance on.regular mathematics, 
data-interpretation, and quantitative comparisons questions, 


although differences were not considered large. ~ 
: v 


“* 


_A third method of checking comparability involved correlating 
each of the four experimental tests with the operational section 
.and- correcting for attenuation (i.e., for the differneces in 
reliability of the experimental tests). Correlations of regular 
mathematics, quantitative comparisons, quantitative comparisons 
plys regular mathematics, and data interpretation experimental 
modules with the quantitative score.were .99, .96, .97 and .97 
respectively. : 

FinaMy, Altman and Conrad (1979. evaluated the concurrent 
validity of the proposed item types by comparing correlation of 
self-reported grades with the operational quantitative score, the 


-5- ‘ - , re KK 


quantitative comparisons module, and the quantitative comparisons 
plus regular math module- Correlations ranged between .24 and .29, 
with virtually no difference between the operational section and. 
the experimental ‘modules. v 
eo . 
Conrad and Altman (1979) also surveyed institutions and . 
students to elicit reactions to restructuring the quantitative 
section of the test. Although a majority of the 1,530 students 
surveyed favored shortening the quantitative section, differences - 
were found by major field. Almost two-thirds of the humanities 
_ majors, approximately one-half of the. social sciences majors, and 
lass. than- one-third of the natupel-ectences majors were -in-favor - 
of a shortened quantitative section. Ninety percent of the 
institutional representatives favored shortening both verbal and 
quantitative measures to provide for a new measure within current 
time allotments. ; ‘ 
To sar heey information from the Altman and Conrad (1979) 
Study indicated that by including quantitative comparison items 
in the quantitative section, ‘reliability could be maintained while 
the amount of time necessary tofcompléte the section could be 
decreased» Too high ‘a proportion of data interpretation items 
detracted from reliability and increased speededness. Inclusion of 
quantitative comparisons would not decrease validity. Results 
Suggest that, although the regular mathematics, data interpretation, 
and quantitative comparison questions are not parallel in the 
strict sense, phe item types could be used together ‘to obtain a 
quantitative score comparable to the existing score. Finally, - 
results indicated that a data interpretation subscore based on a 
25-minute module should not be provided, due to Re ASABSIACY 
and speededness considerations. 


GRE Analytical 


Thfoughout the discussions on restructuring the Aptitude’ fer, 
one idea remained constant - the addition of a third module c° 
broaden skills measured by the Aptitude Test. In early discussions 
of the third module, many options were considered. It was decided 
to focus op a module requiring minimal research for the present 
restructuring, while continuing to do research on theoretical 
measures “mish. as sclentific thinking and documented accomplish= 
ments. 


Altman and Conrad (1979) surveyed institutions and students. 
Of the possible new measures listed (abstract reasoning, scientific 
thinking, and "study style"), abstract reasoning was favored by 
both faculty and students. Based on this interest, interest 
expressed by the Board, and the availability of item types, it was, 
decided to try to develop a new reasoning module. 
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Seven item types were identified an possible componente dé the _**: 

new module. These were included as experimental sections of regular 
national administrations during the 1975-76 academic year.- Each 
of the seven modules (including mfxtures of question types) are 
described briefly below. For further information and sample items 
see the GRE Technical Manual (Conrad, Trismen, & Miller, 1977). 
« % . / 

(1) Letter Sets - Each item consists of five groups .o 
letters, only one group of which is unlike the others . eg 
in alphabetic arrangement. The examinee’s task jis to 

7 __Adentify~ that dissimilar group. This item type/ originated 

I ; y- in the Kit of Factor-Referenced Cognitive Tests and was 
, intended to measure inductive reasoning. 


(2) icptGnt peabentan gene oad Sets - Logical Reasoning 
items are based on brief “arguments or statements pre- 
senting evidence of opinions. The questions require 
recognition of unstated presuppositions, logical flaws, 
methods of persuasion, and conclusions logically following 

5 from arguments. From earlier pretests, it was known that 
: the, item type correlated highly with the verbal score. 
However, it was hypothesized that a combination of 
Logical Reasoning and Letter Sets might ‘be appropriate as 
a measure of reasoning. Experiments in the Law School 
Admission Test Program showed that Logical Reasoning _ 4 
‘questions had high criterion validity. * 


(3) Analytical Reasoning and Letter Sete - Analytical Reasoning 
questions are based on brief sets of statements expressing 
relationships among abstract symbols (letters) or sets of 
rules governing processes or procedures having few concrete 
referents. The examinee draws inferences from and/or 
critically assesses those sets\ of statements. It was 
hypothesized that the combination of Letter Sets and 
Analytical Reasoning items might be appropriate for the 


new module. | ; 


brief narrative establishing a situation and a conclusion 
drawn.from the facts presented. The items consist of bits 

of evidence that, in relation to the situation described, 
strengthen, weaken, confirm, disprove, or fail to affect j \ 
the conclusion. 


9 Evaluation of Evidence - These questions are based on a ‘ 


(5) Analysis of Explanations - These questions are based*on 
brief narratives establishing a situation in which an 
action is taken in order to have a specific effect. A : 
later result, which may or may not be directly related to 
the action, is described in a brief statement. Each 
question is a piece of information that must be. evaluated 


in terms of facts and results. 
4 


= 
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(6) Logical Diagrams - This item type is derived from Venn 
diagram’ and has been used in the Kit of Factor-Referenced 
Cognitive Tests. Each item consists of three nouns and 


the examinee is- asked to select the circle diagram that 
best characterizes the relationship of the three. 


‘ (7) Deductive Reasoning - This item type consists of a 


relatively complex set of rules which the student is === ~*“~~~—~ - - 


asked to apply in solving problems based on diagrams. 


Research was designed to answer the following questions about 
the above item types: es 
1) Will.the item types yfeld material of appropriate Aertel: 
reliability, and unspeededness? 


2) Will the item types measure‘skills that are relatively 
independent of verbal and quantitative skills? 


3) Will the item types have criterion validity? 


~ 4) What combination of item types appears to be best in terms 

of (a) efficiency, (b) face validity, (c) criterion validity, 
(d) independence of V and Q, and (e) appropriateness for 
both science and humanities-social science students. 


¢ 


Each of the experimental tests was taken by a substantial number of 
GRE test-takers. In all but ofhe case, at least ‘three samples were 
drawn: a representative sample of students, a sample of biological 
and physical sciences majors, and a sample of humanities and social 
science majors. In addition, separate analysis for logical (Venn) 
diagrams were based on samples of black males, black females, white 
males, and white females. 


Statistical analyses were used to investigate the item types’ 
efficiency, criterion validity, difffculty, reliability, speeded- 
ness, independence of nd Q, and appropriateness for ‘students 
with different academic backgrounds. To assess the face ‘validity 
of each item type and the way in which different groups perceived 
its utility, surveys were administered to samples of students who 
had taken each of the experimental item types, and two student 
committees examined dnd reacted to samples of the item types. 
Presentations were made at a number of national and regional 
meetings of professional associations, and the item types were 
briefly discussed by some GRE Advanced Test committes of examiners. 
As a result of this work, it was possible to wake a number of 
preliminary decisions about the appropriateness of each of the 
seven item types as a possible part of a new module. 


Based on the results of both statistical analysis and question- 
naire responses, the ratings for each.of the item types are 
summarized in the following table, where (+) indicates a favorable 


: m4, P 
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BP nae ae | » om 
rating, (-) an unfavorable rating, and (0) a neutral rating. It ° 


should be noted that the ratings for a given item type are relative 
to the per iormence of the other item types as shown in the research. 


Criterion Independence 


s Aaa 


Face 
sepia 


are 
a 
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*Logical 
Diagrams ++ 


Item Type | Difficulty | Efficiency 


Letter Sets 


Logical : 
Reasoning 


*Analytical ° 
Reasoning 


Evaluation 
of Evidence 


tf 


*Analysis %of 
Explanations 


2 
+ + 


Deduct ive 
Reasoning 


*Chosen for inclusion in the analytical ability module. . 
és y 


: Based on this research, three of the seven las were 
accepted for use in the analytical ability module- lytical 
reasoning, lysis of explanations, and logical diagrams (asterisked 
item types pigs table). These were chosen, not only for their 
individually good performance on the various criteria, but also for 
their combined balance on the criteria. 


‘However, ce evaluation of evidence {tem type was considered 
a viable alternative to the analysis of explanations item type and 
therefore warranted further study. Four hypotheses were studied by 
Thompson and Conrad (1979): ; 


(1) That previous indications of student interest in evaluation A 
of evidence were due primarily to the happenstance that 


-_ 
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sd 


the subject matter content was more interesting, not. to 


. characteristics of the item type itself; that students 


will find analysis of explanations and evaluation of 
evidence equally interesting if topical content of the 


_ experimental tests is matched. : 


That experimental tests combining evaluation of evidence 
and‘ analysis ofexp lanation items will he moré speeded 
and will result in lower scores than tests containing 
only one item Ey pee: : 


That previous indications of higher validity for analysis oe 
of explanations than for evaluation of evidence will be 
supported by this study, using the criteria of overall 
undergraduate grade-point average in the last two years’ 
and undergraduate grade-point average in major field. 
(This hypothesis was made despite the fact that the 
criterion data.will differ in two ways from data collected 
in the previous study: a) the data will be on the ‘ 
registration form instead of the answer sheet; and b) the 
undergraduate grade-point average data do not match the 
overall four-year grade-point average data collected 
previously.) 

% 


Six new experimental modules were spiralled and administered at 
at the April 1977 administration of the GRE. A sample of examinees 
were asked to comment on these modules. Examinees ‘ranked evaluation 
of evidence items as more interesting than analysis of explanations. 
Most examinees would prefer having both types of items. F 


The evaluation of evidence test was more speeded, than the 
‘analysis of explanations test. Tests composed of both item types 
were mote difficult and speed than tests containing gnly one 

DAE rer ata ‘in validities of the two item types using the 
criteria of se 


reported undergraduate grades in all four years, 


last two years, and major field were not marked. t 
. a= PS 
Based on the Thompson and Conrad study, the decision was made 4 


to retain the original decision of an analytical measure consisting 
. of analytical reasoning, logical diagrams, and analyais of ,expla- 
nations item types. 
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Conclusions u . 


Since’ October 1977, the restructured GRE Aptitude Test has 
been administered., This test produces three scores -- verbal, 
quantitative and analyfical ability. Further research on the os i 
analytical score is clrrently in progress, and institutions are - 
advised to withhold the analytical score from use in the selection 
process until the relationship between the score and the performance 


ry 
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of graduate students is established. Further research’ on the first 
operational year of the restructured Aptitude Test will be published 


as it hecomay available. 4 
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